Skip to main content

Speech-to-Text API

This page outlines the fundamentals of using the Speech-to-Text API. Covered in this page is information on the types of requests you can make using Speech-to-Text, how to construct those requests, and how to handle their responses. It's recommended that you read this page in its entirety before diving into the Speech API.

Speech Requests

Speech-to-Text has two main methods of performing speech recognition. These are listed and described as follows:

Synchronous Requests

With synchronous requests (REST), audio data is sent to the Speech-to-Text API, recognition is performed on that data, and results are returned once all audio has been processed. Synchronous recognition requests are limited to audio data of 1 minute or less in duration.

Request TypeAudio Length Limit
Synchronous Request≤ 60 seconds
Asynchronous Request≤ 400 minutes

Supported formats

  • File Type - We currently only support wav, amr, flac, and ogg. audio files.

  • Sample Rate - We support all sample rates between 8 000 Hz and 48 000 Hz. If you can choose the sample rate of the source, record the audio at 16 000 Hz. This is because sample rates below that might affect the accuracy of our models, and sample rates above 16 000 Hz have no significant impact on the accuracy of our models.

Speech-to-Text API

Synchronous Request

Synchronous recognition requests are the simplest means of performing recognition on speech audio data. The Speech-to-Text API can process up to 1 minute of speech audio data sent in a synchronous request. After the Speech-to-Text API processes and recognizes all of the audio, it returns a response. A sample request is shown in the section that follows:

Endpoint: /asr

https://api.botlhale.xyz/asr
tip

You need to include an Authentication Token in request headers. See the Authentication page of this documentation for information on how to generate authentication token codes.

Method: POST

This endpoint processes speech files for Automatic Speech Recognition (ASR). It transcribes spoken language into text and returns the transcription. The audio file is also temporarily stored and uploaded to an S3 bucket, with the S3 file name included in the response.

Authentication

A valid Bearer token must be included in the request headers for authentication.

Headers:

  • Authorization: Bearer <your_token>

Form Arguments

Request ParamsFile TypeDescription
speech_fileFileRequiredThe audio file to be transcribed.
redactboolOptionalThe sample rate of the supplied audio clip in hertz, for example, 8kHz rendered as 8 000.
language_codeStringOptionalThe language code of the spoken language in the audio file. If not provided, automatic language detection will be attempted.

Response body

The API returns a JSON object with the following structure:

Unset
{
"transcription": "Hello, how can I assist you?",
"s3_filename": "uploads/audio_123456.wav",
"date_received": "2025-01-28T10:00:00Z"
}

Fields:

Request ParamsData TypeDescription
transcriptionstringThe transcribed text from the speech file.
s3_filenamestringThe name of the uploaded file in the S3 bucket.
date_receivedstringThe timestamp when the request was processed, in ISO 8601 format.

Speech to Text endpoints (async)

Asynchronous Request

Asynchronous recognition requests are another means of performing recognition on speech audio data. This request type requires you to first upload the audio file to our server for the asynchronous process to start. The asynchronous request initiates an asynchronous operation and returns this operation immediately. Asynchronous speech recognition can be used for audio data with a length of up to 400 minutes.

Endpoint /asr/async/upload

https://api.botlhale.xyz/asr/async/upload
tip

You need to include an Authentication Token in request headers. See the Authentication page of this documentation for information on how to generate authentication token codes.

Method: POST

This endpoint generates a presigned URL that allows users to upload a speech file for asynchronous Automatic Speech Recognition (ASR) processing. Once the file is uploaded, it will be processed asynchronously, and a notification can be sent to a specified URL when the transcription is complete.

Authentication

A valid Bearer token must be included in the request headers.

Headers:

  • Authorization: Bearer <your_token>

Form Arguments

Request ParamsFile TypeDescription
org_idStringRequiredThe unique identifier for the organization making the request.
language_codestringOptionalThe language spoken in the supplied audio clip. If not provided, the language will be auto-detected.
sample_rateintegerOptional,default: 16000The sample rate of the supplied audio clip in hertz.
diarizationBooleanOptional,default: FalseWhether to use speaker diarization to differentiate between multiple speakers.
voice_idStringOptionalA unique identifier for the speaker, if applicable.
notify_urlStringOptionalA URL to notify once the ASR processing is complete.

Request Body

Unset
{
"upload_url": "https://s3-bucket-url.com/presigned-upload-link",
"fields": {
"key": "asr_uploads/audio_123456.wav",
"AWSAccessKeyId": "AKIA...",
"policy": "base64-encoded-policy",
"signature": "signature-string"
},
"expires_in": 3600
}

Upload via Presigned URL

The generated presigned URL includes both a URL and additional fields that must be passed as part of the subsequent HTTP POST request. The following code demonstrates how to use the requests package with a presigned POST URL to perform a POST request for file upload.

Form Arguments

Request ParamsFile TypeDescription
policyStringRequiredeyJleHBpcmF0aW9uIjogIjIwMjUtMDItMjFUMDc6MTc6MzRa....
x-amz-algorithmstringRequiredAWS4-HMAC-SHA256.
x-amz-credentialstringRequired,default: 16000ASIA2ADMPV7EBIIIA3UR/20250221/eu-west-1/s3/aws4_request.
x-amz-datestringRequired,default: False20250221T061734Z.
x-amz-security-tokenStringRequiredIQoJb3JpZ2luX2VjEKf//////////wEaCWV1LXdlc3QtMSJHMEUC...
x-amz-signatureStringRequirede3cca032a465e57b837b5d2...
filefileRequiredA URL to notify once the ASR processing is complete.

Response Body

json
1
import requests

url = "{{uploadUrl}}"

payload = {'AWSAccessKeyId': '{{fields-AWSAccessKeyId}}',
'key': '{{fields-key}}',
'policy': '{{fields-policy}}',
'signature': '{{fields-signature}}',
'x-amz-security-token': '{{fields-x-amz-security-token}}'}
files=[
('file',('tts_aw215n3s4ni4_IsiZulu_H127Bqf8aN08.wav',open('KpALthHva/tts_aw215n3s4ni4_IsiZulu_H127Bqf8aN08.wav','rb'),'audio/wav'))
]
headers = {}

response = requests.request("POST", url, headers=headers, data=payload, files=files)

print(response.text)

Endpoint: /asr/async/status

https://api.botlhale.xyz/asr/async/status
tip

You need to include an Authentication Token in request headers. See the Authentication page of this documentation for information on how to generate authentication token codes.

This endpoint returns the status of the asynchronous request process.

Request ParamsData TypeDescription
OrgIDStringRequiredOrganisation ID
FileNameTextRequiredThe filename generated from the async upload process.

Request Example

import requests

url = "https://api.botlhale.xyz/asr/async/status?OrgID=<OrgID>&FileName=<filename>"

payload={

}

files=[

]

headers = {
'Authorization': 'Bearer <IdToken>'
}

response = requests.request("GET", url, headers=headers, data=payload, files=files)

print(response.json())

Response body

{
"data": [
{
"OrgID": "<OrgID>",
"id": 207891841473145364,
"process": "<filename>.wav",
"processTime": "processTime",
"status": "running"
}
]
}

ASR Async get data GET

https://api.botlhale.xyz/asr/async/data
tip

You need to include an Authentication Token in request headers. See the Authentication page of this documentation for information on how to generate authentication token codes.

This endpoint returns the status of the async process.

Request ParamsData TypeDescription
OrgIDStringRequiredOrganisation ID
FileNameTextRequiredThe filename generated from the async upload process

Request Example

import requests

url = "https://api.botlhale.xyz/asr/async/getdata?OrgID=<OrgID>&FileName=<filename>"

payload={

}

files=[

]

headers = {
'Authorization': 'Bearer <IdToken>'
}

response = requests.request("GET", url, headers=headers, data=payload, files=files)

print(response.json())

Response body

{
"audio_length": "30.0",
"filename": "/<filename>.wav",
"status": "complete",
"time": {
"diarization": 6.815945625305176,
"recognition": 4.098539113998413
},
"timestamps": [
{
"end": 1260.0000000000005,
"filename": "1_speaker_0_660.0000000000003_1260.0000000000005_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 660.0000000000003,
"transcription": "<transcription>"
},
{
"end": 2310.0000000000014,
"filename": "2_speaker_1_1260.000000000001_2310.0000000000014_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 1260.000000000001,
"transcription": "<transcription>"
},
{
"end": 2699.9999999999995,
"filename": "3_speaker_0_2309.9999999999995_2699.9999999999995_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 2309.9999999999995,
"transcription": "<transcription>"
},
{
"end": 6359.999999999998,
"filename": "4_speaker_1_2699.9999999999973_6359.999999999998_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 2699.9999999999973,
"transcription": "<transcription>"
},
{
"end": 6780.000000000008,
"filename": "5_speaker_0_6360.000000000008_6780.000000000008_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 6360.000000000008,
"transcription": "<transcription>"
},
{
"end": 7860.000000000012,
"filename": "6_speaker_1_6780.000000000012_7860.000000000012_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 6780.000000000012,
"transcription": "<transcription>"
},
{
"end": 8580.000000000022,
"filename": "7_speaker_0_7860.000000000021_8580.000000000022_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 7860.000000000021,
"transcription": "<transcription>"
},
{
"end": 13950.000000000011,
"filename": "8_speaker_1_8580.00000000001_13950.000000000011_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 8580.00000000001,
"transcription": "<transcription>"
},
{
"end": 15239.999999999889,
"filename": "9_speaker_1_14249.999999999887_15239.999999999889_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 14249.999999999887,
"transcription": "<transcription>"
},
{
"end": 15929.999999999867,
"filename": "10_speaker_0_15239.999999999867_15929.999999999867_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 15239.999999999867,
"transcription": "<transcription>"
},
{
"end": 18629.999999999854,
"filename": "11_speaker_1_15929.999999999853_18629.999999999854_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 15929.999999999853,
"transcription": "<transcription>"
},
{
"end": 19739.99999999995,
"filename": "12_speaker_0_18629.99999999995_19739.99999999995_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 18629.99999999995,
"transcription": "<transcription>"
},
{
"end": 21839.999999999993,
"filename": "13_speaker_1_19739.999999999993_21839.999999999993_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 19739.999999999993,
"transcription": "<transcription>"
},
{
"end": 22410.000000000073,
"filename": "14_speaker_0_21840.00000000007_22410.000000000073_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 21840.00000000007,
"transcription": "<transcription>"
},
{
"end": 24360.00000000009,
"filename": "15_speaker_1_22410.00000000009_24360.00000000009_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 22410.00000000009,
"transcription": "<transcription>"
},
{
"end": 25590.000000000167,
"filename": "16_speaker_0_24360.000000000167_25590.000000000167_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 24360.000000000167,
"transcription": "<transcription>"
},
{
"end": 26430.000000000215,
"filename": "17_speaker_1_25590.000000000215_26430.000000000215_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 25590.000000000215,
"transcription": "<transcription>"
},
{
"end": 28380.000000000244,
"filename": "18_speaker_0_26430.000000000244_28380.000000000244_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 26430.000000000244,
"transcription": "<transcription>"
},
{
"end": 29220.00000000032,
"filename": "19_speaker_1_28380.00000000032_29220.00000000032_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_1",
"start": 28380.00000000032,
"transcription": "transcription"
},
{
"end": 30000.000000000353,
"filename": "20_speaker_0_29220.00000000035_30000.000000000353_nso-ZA.wav",
"language": "nso-ZA",
"speaker": "speaker_0",
"start": 29220.00000000035,
"transcription": "<transcription>"
}
]
}

Speech to Text endpoints (async)

Endpoint: /asr/async/upload

Method: POST

This endpoint generates a presigned URL that allows users to upload a speech file for asynchronous Automatic Speech Recognition (ASR) processing. Once the file is uploaded, it will be processed asynchronously, and a notification can be sent to a specified URL when the transcription is complete.

Authentication

A valid Bearer token must be included in the request headers.

Headers:

  • Authorization: Bearer <your_token>

Form Arguments

Request ParamsData TypeRequiredDescription
org_idstringOptionalThe unique identifier for the organization making the request.
language_codestringOptionalThe language spoken in the supplied audio clip. If not provided, the language will be auto-detected.
sample_rateintegerOptional, default: 16000The sample rate of the supplied audio clip in hertz.
diarizationboolOptional,default: FalseWhether to use speaker diarization to differentiate between multiple speakers.
voice_idstringOptionalA unique identifier for the speaker, if applicable.
notify_urlstringOptionalA URL to notify once the ASR processing is complete.

Response

The API returns a JSON object containing a presigned URL and the required fields for uploading the audio file.

Example Response:

Unset
{
"upload_url": "https://s3-bucket-url.com/presigned-upload-link",
"fields": {
"key": "asr_uploads/audio_123456.wav",
"AWSAccessKeyId": "AKIA...",
"policy": "base64-encoded-policy",
"signature": "signature-string"
},
"expires_in": 3600
}


Response Fields:

Request ParamsData TypeDescription
upload_urlstringThe presigned S3 URL where the speech file should be uploaded
fieldsdictionaryContains additional parameters required for the file upload, including authentication credentials.
expires_ininteger, default: 3600The number of seconds before the presigned URL expires (1 hour).

Endpoint: /asr/async/status

Method: GET

This endpoint retrieves the status of an asynchronous ASR (Automatic Speech Recognition) process and returns the results if the process is completed.

Authentication

A valid Bearer token must be included in the request headers.

Headers:

  • Authorization: Bearer <your_token>

Form Arguments

Request ParamsData TypeRequiredDescription
org_idstringRequiredThe organization ID associated with the request.
filenamestringRequiredThe filename generated during the async ASR upload process.

Response

The API returns a JSON object containing the status of the process and the results if available.

Example Response (Running ):

Unset
{
"status": "running",
"location": "location",
"inference_id": "org_98765"
}

Example Response (Completed):

Unset
{
"status": "completed",
"location": "location",
"inference_id": "org_98765"
}

Endpoint: /asr/async/data

Method: GET

This endpoint retrieves the status and detailed results of an asynchronous ASR (Automatic Speech Recognition) process, including transcription, speaker diarization, timestamps, and redacted versions of speech segments.

Authentication

A valid Bearer token must be included in the request headers.

Headers:

  • Authorization: Bearer <your_token>

Form Arguments

Request ParamsData TypeRequiredDescription
org_idstringRequiredThe organization ID associated with the request.
filenamestringRequiredThe filename generated during the async ASR upload process. The format should be OrgID/filename.wav

Response

The API returns a JSON object containing metadata about the processed audio and detailed transcription data.

Example Response (Completed)

Unset
{
"audio_length": 293.52,
"date_received": "29/01/2025 23:04:28",
"filename": "asr_618Ilr3ux6b7__16000_BotlhaleAI999__True_https:******botlhaleai**free**beeceptor**com_3M50XY73w6LS29012025_230351",
"time": {
"diarization": 11.390989780426025,
"recognition": 84.02089238166809
},
"timestamps": [
{
"emotion": "neu",
"end": 4083.191850594227,
"filename": "0_SPEAKER_00_1112.0543293718165_4083.191850594227.wav",
"language": "English",
"redaction": "Hello, good day. You're speaking to <PERSON> from <LOCATION>. How are you doing today?",
"speaker": "SPEAKER_00",
"start": 1112.0543293718165,
"times": {
"asr": 1.1830341815948486,
"emotion": 4.76837158203125e-07,
"red": 0.024340391159057617,
"sli": 0.01774907112121582
},
"transcription": "Hello, good day. You're speaking to Nick from Kuru. How are you doing today?",
"transcription_no_LM": "",
"translation": "-"
},
{
"emotion": "neu",
"end": 292979.6264855688,
"filename": "80_SPEAKER_01_291281.83361629886_292979.6264855688.wav",
"language": "English",
"redaction": "Thank you.",
"speaker": "SPEAKER_01",
"start": 291281.83361629886,
"times": {
"asr": 0.9126956462860107,
"emotion": 4.76837158203125e-07,
"red": 0.020905494689941406,
"sli": 0.01774907112121582
},
"transcription": "Thank you.",
"transcription_no_LM": "",
"translation": "-"
}
]
}

Response Fields

General Metadata

Request ParamsData TypeDescription
audio_lengthfloatThe total duration of the audio file in seconds.
date_receivedstringThe date and time when the request was received.
filenamestringThe filename associated with the ASR request.
timedictionaryProcessing time details: diarization (float) – Time spent on speaker diarization, and recognition (float) – Time spent on speech recognition.

Timestamps (List of Speech Segments)

Each object in the timestamps list represents a spoken segment with the following details:

Request ParamsData TypeDescription
startfloatStart time (milliseconds).
endfloatEnd time (milliseconds).
speakerstringIdentified speaker ID (SPEAKER_00, SPEAKER_01, etc.).
filenamestringThe audio snippet filename corresponding to this segment.
languagestringThe detected language of the speech.
emotionstringThe predicted emotion (e.g., "neu" for neutral).
redactionstringThe redacted version of the speech, replacing sensitive data (<PERSON>, <LOCATION>).
transcriptionstringThe full transcript of the spoken segment.
transcription_no_LMstringA version of the transcript without language model post-processing.
translationstringTranslation of the speech (if applicable).

Processing Time per Segment

Request ParamsData TypeDescription
asrfloatTime taken for automatic speech recognition.
emotionfloatTime taken for emotion detection.
redfloatTime taken for redaction processing.
speakerfloatSpeaker label identification time.

Contact us

info

We are here to help! Please contact us with any questions.